Skip to content

feat: support planning cleanup#7147

Open
yanghua wants to merge 5 commits into
lance-format:mainfrom
yanghua:cleanup-plan
Open

feat: support planning cleanup#7147
yanghua wants to merge 5 commits into
lance-format:mainfrom
yanghua:cleanup-plan

Conversation

@yanghua

@yanghua yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Background / Motivation

cleanup_old_versions today behaves as a black box: callers hand in a policy
and either get a RemovalStats back or, on failure, a partially mutated
dataset with no record of which files were inspected, kept, or deleted.

That opacity becomes painful in three scenarios we hit in production:

  1. Operational dry-run. Operators want to know exactly which files an
    upcoming cleanup will remove (and how many bytes that frees) before
    actually running it, especially on tables with 100k+ fragments where a
    mistaken policy could remove tens of GB.
  2. Auditing and reproducibility. When a cleanup is triggered automatically
    (commit hooks, schedulers), there is no artifact we can inspect afterwards
    to answer "why did this file go away?". The tracing audit log helps, but
    only if you were already capturing it.
  3. Two-phase execution. Some deployments want to plan on one node and
    execute on another (or in a maintenance window), which the current API
    does not support at all.

This PR splits cleanup into an explicit plan and execute pair, while
keeping the existing cleanup_old_versions entry point byte-for-byte
compatible. The plan is a serializable description of every file the cleanup
intends to delete, the reason it qualifies, and the dataset snapshot it was
built from.

What's in this PR

  • New public APIs:
    • plan_cleanup(&Dataset, CleanupPolicy) -> CleanupPlan
    • cleanup_with_plan(&Dataset, CleanupPlan) -> RemovalStats
  • CleanupPlan / CleanupFile / CleanupFileKind / CleanupFileReason /
    CleanupPlanStats / CleanupReferencedBranch data types.
  • Internal refactor of CleanupTask into three explicit execution paths
    (cleanup_old_versions / cleanup_with_plan / commit hooks), each with
    its own trust model documented in-source.
  • cleanup_with_plan validates: dataset URI, base path, that every planned
    path stays under the dataset base, and that the plan's read_version
    still matches the storage-resolved latest version. A residual TOCTOU
    window between the version check and the deletes is documented in the
    rustdoc; callers running concurrent writers must serialize externally.
  • Plan creation resolves the latest version from storage rather than from
    the in-memory dataset handle, so plans built from a stale handle are
    still safe.
  • Listing-consistency guard: planning fails if list_manifest_locations
    did not return the storage-resolved latest version (defends against
    eventual-consistency or racing list output).

Behavior changes worth flagging

  • RemovalStats returned by cleanup_old_versions now includes the stats
    of cascaded clean_referenced_branches cleanups. Previously those were
    silently dropped. Monitoring/dashboards that compared against the old
    numbers will see an increase.
  • cleanup_with_plan will reject a plan if any commit lands between
    plan_cleanup and cleanup_with_plan on the same dataset. This is by
    design — see rustdoc. The internal cleanup_old_versions path is
    unaffected.

Tests

  • plan_cleanup_does_not_delete_files
  • plan_cleanup_uses_latest_version_with_stale_handle
  • cleanup_with_plan_rejects_stale_version
  • cleanup_with_plan_rejects_toctou_commit_with_stale_handle
  • internal_cleanup_plan_allows_toctou_commit_before_delete
  • process_manifests_rejects_listing_missing_latest_version
  • All existing cleanup_old_versions / cleanup_with_policy tests continue
    to pass unmodified.

@github-actions github-actions Bot added the enhancement New feature or request label Jun 8, 2026
@codecov

codecov Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 89.98073% with 52 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/cleanup.rs 91.74% 27 Missing and 15 partials ⚠️
rust/lance/src/dataset.rs 0.00% 10 Missing ⚠️

📢 Thoughts on this report? Let us know!

@yanghua

yanghua commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

@claude review

Comment thread rust/lance/src/dataset/cleanup.rs
Comment thread rust/lance/src/dataset/cleanup.rs
@yanghua yanghua marked this pull request as ready for review June 8, 2026 11:38

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@yanghua yanghua requested a review from Xuanwo June 8, 2026 13:57

@Xuanwo Xuanwo left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should follow SQL EXPLAIN semantics: the cleanup plan should be a dry-run/audit report, not a materialized deletion plan.

Execution should re-evaluate cleanup from the current dataset/ref state instead of trusting an old file list. For example, a tag or branch can be added after planning without advancing the manifest version, so the current read_version check can still pass while the old plan deletes files that are now protected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants